Qualcomm AI Engine Direct - remove prefill calibration by haowhsu-quic · Pull Request #17805 · pytorch/executorch

haowhsu-quic · 2026-03-03T06:05:41Z

Summary

calibrate kv text decoder only to reduce calibration time
deprecate outdated implementation & use deterministic example inputs for llm

Total Quantization Time

Model	Before(s)	After(s)	Improvement
gemma-2b	2203.399	999.512	54.64%
gemma2-2b	2177.285	1001.248	54.01%
gemma3-1b	1776.861	548.312	69.14%
glm-1_5b	1434.780	677.257	52.8%
granite_3_3-2b	59566.790	6165.443	89.65%
llama3_2-1b	4528.620	2953.233	34.79%
llama3_2-3b	5744.429	1652.157	71.24%
phi_4_mini	7005.601	2071.634	84.56%
qwen2_5-0_5b	480.508	372.076	22.57%
qwen2_5-1_5b	2064.333	899.164	56.44%
qwen3-0_6b	1673.150	1124.149	32.81%
qwen3-1_7b	3253.723	1148.511	64.7%
smollm2_135m	502.779	414.510	17.56%
smollm3-3b	4663.057	1613.516	65.4%
smolvlm_500m_instruct	288.246	170.829	40.73%
internvl3_1b	256.624	170.811	33.44%

Test plan

python backends/qualcomm/tests/test_qnn_delegate.py -k TestExampleLLMScript / TestExampleMultimodalityScript

- calibrate kv text decoder only to reduce calibration time

pytorch-bot · 2026-03-03T06:05:44Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17805

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 4 New Failures

As of commit 311249c with merge base 0c2ff55 ():

NEW FAILURES - The following jobs have failed:

pull / test-samsung-models-linux / linux-job (gh)
RuntimeError: Command docker exec -t 72024bfadeca5197ac4bf1d24a523e1624a4b3c22977862ee34bfc1bc52d0a83 /exec failed with exit code 1
pull / test-samsung-quantmodels-linux / linux-job (gh)
RuntimeError: Command docker exec -t 478b224727c391415fae78f860784cbcd381e82ab6321de09f311dc268e41c73 /exec failed with exit code 1
pull / unittest / macos / macos-job (gh)
backends/xnnpack/test/ops/test_conv2d.py::TestConv2d::test_fp16_conv2d
pull / unittest-editable / windows / windows-job (gh)
Process completed with exit code 1.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

haowhsu-quic · 2026-03-03T06:08:35Z

Hi @cccclai, this PR is to reduce the calibration time of llms. Might be able to mitigate #17784 a bit, will keep working on other possible optimizations.
I also notice that #17718 is breaking torchtune package, could you help update torchtune repo? Thank you!

haowhsu-quic · 2026-03-03T06:26:55Z

@pytorchbot label "release notes: qualcomm"

cccclai · 2026-03-03T17:24:09Z

Thanks a lot! In addition to this, I noticed that SeqMSE grid searching is done in sequential instead of parallel, is there room to improve there?

haowhsu-quic · 2026-03-04T01:15:24Z

Thanks a lot! In addition to this, I noticed that SeqMSE grid searching is done in sequential instead of parallel, is there room to improve there?

Yes, will look into it.

cccclai · 2026-03-04T17:34:08Z

I also notice that #17718 is breaking torchtune package, could you help update torchtune repo?

can you share what's broken? Is it the flow to consume torchtune in executorch repo?

cccclai · 2026-03-04T17:34:46Z

examples/qualcomm/oss_scripts/llama/model/static_llama.py

                # transpose first to decrease the runtime efforts
                k_cache.append(
-                    torch.zeros(
+                    torch.ones(


Are we initialized kv cache with different values?

This is for checking the numerical value with deterministic input for validation purpose. I can revert them back to zeros.

cccclai · 2026-03-04T17:35:57Z

examples/qualcomm/oss_scripts/llama/wrappers/llm_wrappers.py

+            if list(decode_node.users)[0].target in ptq_target:
+                activation_override(decode_node, prefill_node)
+
+        # copy encoding for hybrid mode


Are you copying over the quantization parameters from kv mode to prefill mode?

Yes, since the current bottleneck of calibration time is using prefill mode for generating user prompts.
I think the future scenario we want to try is to generate all the special tokens for target model and merge them with task calibration data. Have prefill model to iterate over it and use prefill's encoding for kv. We could save the user prompt generation with this approach.

cccclai · 2026-03-04T17:36:31Z

examples/qualcomm/oss_scripts/llama/wrappers/llm_wrappers.py

+        #
+        # however, pytorch will use different computaion kernels for different
+        # workloads (AR1 vs ARN) which will introduce some numerical discrepancy.
+        #


what is the mechanism to make sure the encoding align correctly?

I'm worried about the accuracy too if we get rid of prefill calibration, do you think if we generate prompt + output using fp32 model (pre observers) as discussed in PR #17786 and run prefill + decode as before with skip_generate might yield better accuracy, rather than getting rid of entire prefill calibration?

prefill calibration ideally is not needed because decode see all the generated tokens too and prefill graph and decode graph should be the same. I remember @haowhsu-quic mentioned we insert the kv cache of the output of prefill and connect to the decode of the input to make sure those quant nodes are also calibrated. I did a comparison for quant params between prefill and decode in the past and they are very very close. I'm trying to figure out if this PR handle kv cache differently than before.

I guess my question is prefill "sees" previous tokens and as attention block will take these into consideration while generating kv-cache.

For weights, what you said makes sense as they are idempotent from the math standpoint, maybe we should just check the PPL on-device?

CC: @metascroy, @kimishpatel if you have any thoughts on this.

The mechanism is to compare the topology order of ops in both graph where each op in the same order should have identical nn_module_stack. The number / type of QDQ pairs in nodes' (call_function / placeholder) users are required to be identical as well.

Will have a fix for condition prefill_ar_len == max_seq_len, thanks for identifying this.

Ran comparison over the weekend, observed great speedups overall, calibration time is cut in half overall time is cut by ~33% for the qwen model i'm benchmarking. I will do a final PPL check on-device and we will be good to go. Thanks again for working on this.

I think we also should find ways to cut per-iteration decode timing.

┌─────────────────────┬──────────┬────────┬─────────────────────┐ │ │ Baseline │ PR │ Savings │ ├─────────────────────┼──────────┼────────┼─────────────────────┤ │ DECODE calibration │ 2h 59m │ 2h 49m │ ~10m (noise) │ ├─────────────────────┼──────────┼────────┼─────────────────────┤ │ PREFILL calibration │ 2h 34m │ 15.5s │ 2h 34m │ ├─────────────────────┼──────────┼────────┼─────────────────────┤ │ Quantization total │ 5h 33m │ 2h 50m │ 2h 43m (49% faster) │ ├─────────────────────┼──────────┼────────┼─────────────────────┤ │ Compile total │ 2h 35m │ 2h 35m │ ~0 (noise) │ ├─────────────────────┼──────────┼────────┼─────────────────────┤ │ End-to-end │ 8h 13m │ 5h 30m │ 2h 43m (33% faster) │ └─────────────────────┴──────────┴────────┴─────────────────────┘

@haowhsu-quic PPL looks fine.

Let's go with the change. Can you please rebase.

The mechanism is to compare the topology order of ops in both graph where each op in the same order should have identical nn_module_stack. The number / type of QDQ pairs in nodes' (call_function / placeholder) users are required to be identical as well.

Is there any particular reason that we remove the custom annotation? I understand that the custom annotation caused prefill/decode have different graphs, but how about the quantization params for kv cache?

I think quantization params for placeholders (kv cache) will also be shared between prefill & decode. They should be the same under identical calibration data. May I learn more about your concern?

haowhsu-quic · 2026-03-05T02:10:09Z

I also notice that #17718 is breaking torchtune package, could you help update torchtune repo?

can you share what's broken? Is it the flow to consume torchtune in executorch repo?

I think we're using torchtune to convert parameter naming in some llms. Since torchao has deprecated the Int8DynamicActivationInt4WeightConfig and torchtune has corresponding fix in this commit.
I wonder if it is possible to help update the pip package of torchtune to resolve this issue?

Qualcomm AI Engine Direct - remove prefill calibration

311249c

- calibrate kv text decoder only to reduce calibration time

haowhsu-quic requested a review from cccclai as a code owner March 3, 2026 06:05

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 3, 2026

pytorch-bot bot added the release notes: qualcomm Changes to the Qualcomm backend delegate label Mar 3, 2026

cccclai requested review from abhinaykukkadapu and navsud March 3, 2026 17:23

This was linked to issues Mar 3, 2026

QNN LLM Compilation Pipeline — Profiling & Optimization #17784

Open

Remove prefill calibration #17827

Open

abhinaykukkadapu removed a link to an issue Mar 3, 2026

QNN LLM Compilation Pipeline — Profiling & Optimization #17784

Open

abhinaykukkadapu mentioned this pull request Mar 3, 2026

Skip autoregressive generation during QNN LLM calibration #17786

Open

cccclai reviewed Mar 4, 2026

View reviewed changes

abhinaykukkadapu approved these changes Mar 9, 2026

View reviewed changes

Conversation

haowhsu-quic commented Mar 3, 2026

Summary

Test plan

Uh oh!

pytorch-bot bot commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17805

❌ 4 New Failures

Uh oh!

haowhsu-quic commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

haowhsu-quic commented Mar 3, 2026

Uh oh!

cccclai commented Mar 3, 2026

Uh oh!

haowhsu-quic commented Mar 4, 2026

Uh oh!

cccclai commented Mar 4, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cccclai Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

haowhsu-quic commented Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pytorch-bot bot commented Mar 3, 2026 •

edited

Loading

haowhsu-quic commented Mar 3, 2026 •

edited

Loading

cccclai Mar 4, 2026 •

edited

Loading